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REMARKS 
Claims 1-4 

Claims 1-4 were rejected under 35 U.S.C. § 103(a) as 
being unpatentable over Smith et al. (U.S. Patent Number 
6, 408,271 B, hereinafter Smith) in view of Hab-Umbach et al. 
(U.S. Patent Number 5,873,061, hereinafter Hab-Umbach). 

Smith discloses a system for generating possible 
pronunciations of a sequence of words. Under Smith, each word 
has many possible pronunciations. As a result, for a sequence of 
words, there are multiple possible combinations of these 
pronunciations. Smith selects the top N pronunciations for the 
sequence of words to store in a dictionary and to use during 
speech recognition. During speech recognition, a decoder 
compares input feature vectors to pronunciations in the 
dictionary to determine if any of the pronunciations match the 
user's speech. 

Hab-Umbach identifies sub-word units for a new word by 
averaging a plurality of utterances of the new word into a 
reference template. The reference template is then compared 
against stored phonetic models to select the sub-word units that 
most likely produced the reference template. Hab-Umbach also 
includes a grapheme-to-phoneme conversion that converts text into 
sub-word units. Hab-Umbach, however, does not score the sub-word 
units produced by the grapheme-to-phoneme conversion and does not 
select between the phoneme sequence produced by the grapheme-to- 
phoneme conversion and the phoneme sequence formed from the 
reference template based on a score for the grapheme-to-phoneme 
sequence . 

Independent claim 1 provides a method of adding an 
acoustic description of a word to a speech recognition lexicon. 
Initially, the text of the word is converted into an 
orthographically derived acoustic description of the word. The 
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orthographically derived acoustic description is then scored 
based in part on a comparison between the orthographically 
derived acoustic description and a speech signal representing a 
user's pronunciation of the word. The speech signal is also used 
to identify a speech-based acoustic description of the word and a 
score for the speech-based acoustic description wherein the 
speech-based acoustic description is not associated with the text 
of the word. One of the orthographically derived acoustic 
description and the speech-based acoustic description is then 
selected as the acoustic description of the word based on the 
scores for the two acoustic descriptions. 

The combination of Smith and Hab-Umbach does not show 
or suggest the invention of claim 1 because neither reference 
shows or suggests selecting one of an orthographically derived 
acoustic description and a speech-based description based on the 
scores for the two acoustic descriptions. 

In the Office Action, it is asserted that column 12, 
lines 26-37 of Smith showed a step of selecting one of an 
orthographically derived acoustic description and a speech-based 
acoustic description based on the scores for these acoustic 
descriptions. Applicants respectfully dispute this assertion. 

The cited section discusses generating possible 
pronunciations for a sequence of words based on possible 
pronunciations for individual words in the sequence. All of the 
pronunciations are generated from text. As such. Smith can not 
take into consideration a score for a speech-based acoustic 
description that is not associated with the text of the word, but 
instead can only score acoustic descriptions that are based on 
text . 

Similarly, Hab-Umbach does not show or suggest 
selecting one of an orthographically derived acoustic description 
and a speech-based acoustic description based on a score for the 
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orthographically derived acoustic description and a score for the 
speech-based acoustic description. 

Since neither Smith nor Hab-Umbach show a step of 
making a selection based on the score of an orthographically 
derived acoustic description and the score of a speech-based 
description, their combination does not form the invention of 
claim 1. As such, claim 1 is not obvious from the combination of 
Smith and Hab-Umbach. 

Claim 5 

Claim 5 was rejected under 35 U.S.C. § 103(a) as being 
obvious from Smith in view of Hab-Umbach and further in view of 
Bahl et al. (U.S. Patent Number 5,875,426 hereinafter Bahl M26). 
Bahl M26 provides a speech recognition system that is able to 
handle word pronunciations that are context dependent. During 
recognition, Bahl '426 first considers all possible stored 
pronunciations for all words in a vocabulary. The speech signal 
is applied to these pronunciations to identify a set of candidate 
words. All of these pronunciations are associated with the text 
of the words. These candidate words are applied to a language 
model that generates a score for each current candidate word 
based on a previously identified word. This results in a ranked 
list of candidate current words and the dictionary-based 
pronunciations of those words. 

Bahl M26 then examines a field in each current word's 
dictionary entry and a field in the preceding word's dictionary 
entry to determine if an additional pronunciation of the word 
should be added as a candidate. Note that this additional 
pronunciation candidate is a rule-based candidate associated with 
the text of the word and is not dependent on how the speaker 
pronounced the word. The speech signal is then applied to these 
candidate words and pronunciations in order to select a most 
likely word. 
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Dependent claim 5 depends from claim 1 and includes a 
further limitation wherein identifying a score for a speech-based 
acoustic description further comprises using a language model . 
Because claim 5 depends from claim 1, it includes the limitation 
to selecting one of the orthographically derived acoustic 
description and the speech-based acoustic description based on 
the score for the orthographically derived acoustic description 
and the score for the speech-based acoustic description. None of 
Smith, Hab-Umbach, or Bahl '426 show such a limitation. 

In particular, Bahl M26 does not show such limitation, 
since it does not generate a speech-based acoustic description 
that is not associated with the text of the word. Since it does 
not generate such a speech-based acoustic description, it cannot 
score a speech-based acoustic description and as such cannot 
select one of an orthographically derived acoustic description 
and a speech-based acoustic description based on a score for a 
speech-based acoustic description. 

Since none of Smith, Hab-Umbach, and Bahl M26 show 
selecting one of a speech based acoustic description and an 
orthographically derived acoustic description based on scores for 
the acoustic descriptions, claim 5 is patentable over the 
combination of Smith, Hab-Umbach, and Bahl M26. 

Claim 6-8 

Claims 6-8 were rejected under 35 U.S.C. § 103(a) as 
being obvious from Smith in view of Hab-Umbach and Bahl M2 6 and 
further in view of Bahl et al. (U.S. Patent Number 6, 377, 921 
hereinafter Bahl '921). 

Bahl '921 provides a system for identifying 
transcription errors in text used for training a speech 
recognition system. Bahl '921 trains a set of acoustic models 
for acoustic units such as words, syllables, and phones. After 
the training is complete, a speech signal is aligned with its 
corresponding transcript using the trained models and a score is 
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determined for each acoustic unit in the transcript. Instances 
of acoustic units that receive a low score from these models are 
then flagged and examined by a human operator to determine if the 
transcription is in error. 

Claims 6-8 depend indirectly from claim 1. As a 
result^ they include the limitation to selecting one of an 
orthographically derived acoustic description and a speech-based 
acoustic description based on the score for the orthographically 
derived acoustic description and the score for the speech-based 
acoustic description. The combination of Smith, Hab-Umbach, Bahl 
M26, and Bahl '921 does not show or suggest this limitation. 

As discussed above. Smith, Hab-Umbach, and Bahl M26 
fail to show a selection based on both a score for an 
orthographically derived acoustic description and a score for a 
speech-based acoustic description. Similarly, Bahl '921 fails to 
show this limitation. Under Bahl '921, speech signals apply to a 
known transcription of the speech that is associated with the 
text of the words. As such, Bahl '921 does not score a speech- 
based acoustic description as found in claim 1. Therefore, it 
can not select an acoustic description based on a score for a 
speech-based acoustic description. Thus, none of the cited 
references show a selection of an acoustic description based on 
both a score for an orthographically derived acoustic description 
and a score for a speech-based acoustic description. 

In addition, in claim 6, generating a score for a 
speech-based acoustic description includes generating a language 
model score for a sequence of syllable-like units. None of 
Smith, Hab-Umbach, Bahl '426, or Bahl '921 show or suggest 
generating a language model score for a sequence of syllable-like 
units . 

In the Office Action, language model 18B of Bahl '921 
was cited as providing a language model score for syllable-like 
units. However, Bahl '921 never states that the language model 



uses syllable-like units. As such, it does not show or suggest 
generating a language model score for a sequence of syllable-like 
units . 

Since none of the cited references show or suggest 
generating a language model score for a sequence of syllable-like 
units and since none of the cited references select an acoustic 
description based on a score for an orthographically derived 
acoustic description and a score for a speech-based acoustic 
description, claim 6 and claims 7 and 8, which depend therefrom, 
are patentable over Smith, Hab-Umbach, Bahl M26, and Bahl '921. 

Claims 9-11 

Claims 9-11 are rejected under 35 U.S.C. § 103(a) as 
being unpatentable over Smith in view of Hab-Umbach, and Bahl 
M26, and further in view of Contolini et al. (U.S. Patent Number 
6,233,553, hereinafter Contolini). 

Contolini provides a method of selecting one 
pronunciation from a set of text-based pronunciations. Under 
Contolini, a plurality of text-based pronunciations are formed 
from the spelling of a word using a transcription generator. The 
top N pronunciations are provided to a speech recognition system, 
which applies a speech signal to the transcriptions representing 
each pronunciation. The transcription that scores highest is 
selected for storage. Contolini does not identify a speech-based 
acoustic description from a speech signal where the speech-based 
acoustic description is not associated with the text of a word, 
nor does it show the production of an acoustic model score for a 
syllable-like unit by generating acoustic model scores for each 
of a sequence of phonemes that form the syllable-like unit. 

Claims 9-11 depend from claim 1 and as such include the 
limitation to selecting an acoustic description based on both a 
score for an orthographically derived acoustic description and a 
speech-based acoustic description- None of the cited references 
show or suggest this limitation. In Contolini, a speech signal 




-8- 

is applied against previously identified transcriptions to 
identify a score for each transcription. Since each of these 
transcriptions is associated with the text of the word, Contolini 
does not identify a speech-based acoustic description that is not 
associated with the text of a word. As such, Contolini can not 
select an acoustic description based on a score for a speech- 
based acoustic description. 

Since none of the cited references show a step of 
selecting an acoustic description based on a score for both an 
orthographically derived acoustic description and a speech-based 
description, claims 9-11 are patentable over the cited art. 

In addition, none of the cited references show or 
suggest generating an acoustic model score for a sequence of 
syllable-like units by generating acoustic model scores for each 
of a sequence of phonemes that form the sequence of syllable-like 
units as found in claim 9. 

In the Office Action, it was asserted that claim 4, 
column 7, line 6 and column 6, line 56 of Contolini show this 
limitation. Applicants respectfully dispute this assertion. 

Claim 4 simply states that the sound units of claim 1 
are acoustic units. Neither claim 1 nor claim 4 make any mention 
of syllable-like units or of determining an acoustic score for a 
syllable-like unit by determining acoustic scores for a sequence 
of phonemes that form the syllable-like units. Column 6, line 56 
describes classes of phonemes including consonant and syllabic. 
This section does not suggest generating an acoustic score for a 
syllable-like unit by determining acoustic scores for a sequence 
of phonemes. Instead, it simply shows that a single phoneme may 
act as a syllable at times. When this occurs, forming an 
acoustic score for the syllable does not require determining the 
acoustic score for a sequence of phonemes. Instead, the acoustic 
score for a single phoneme is determined. 
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Column 1, line 6 discusses filtering unlikely sequences 
of phonemes. It does not show or suggest determining an acoustic 
score for a syllable-like unit by generating acoustic scores for 
each of a sequence of phonemes that form the syllable-like unit. 

Since none of the cited references show or suggest 
determining an acoustic score for a syllable-like unit by 
determining acoustic scores for a sequence of phonemes that form 
the syllable-like unit^ the combination of these references does 
not show or suggest claim 9. 

Claims 12-17 

Claims 12-17 were rejected under 35 U.S.C. § 103(a) as 
being unpatentable over Gupta et al. (U.S. Patent Number 
6,243,680, hereinafter Gupta) in view of Hab-Umbach. 

Gupta provides a system for selecting a pronunciation 
of a word for entry into a dictionary. Under Gupta, the text of 
a new word is first converted into a string of phonemes using a 
set of text-to-phoneme rules 412. These phonemes are placed in a 
graph structure with each branch in the structure being 
represented by a different phoneme. For each phoneme branch, a 
set of parallel branches are constructed, one for each phoneme 
that is similar to the initial phoneme in the graph. Additional 
parallel branches are then added for each allophone of each 
phoneme in the graph where an allophone is a particular 
pronunciation of a phoneme. Gupta then applies a set of speech 
utterances to the graph to score each path through the graph. 
The path with the highest score is selected as the pronunciation 
of the word. 

Independent claim 12 provides a computer-readable 
medium having instructions for selecting a phonetic description 
of a word to add to a lexicon. These steps include receiving the 
text of the word and a speech signal representing a person's 
pronunciation of the word. The text of the word is converted 
into a text-based phonetic description while the speech signal is 
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used to generate a speech-based phonetic description of the word 
without using the text of the word. Either the text-based 
phonetic description or the speech-based phonetic description is 
then selected for entry in the lexicon based on the 
correspondence between each phonetic description and the speech 
signal . 

The combination of Gupta and Hab-Umbach does not show 
or suggest the invention of claim 12 because the combination does 
not include a step of selecting between a text-based phonetic 
description and a speech-based phonetic description. 

In the background section of Gupta, different types of 
systems for identifying pronunciations of new words are 
described. In one system, an expert listens to the word and 
identifies the acoustic description. In a separate system, a 
continuous allophone recognizer is used that decodes speech 
utterances to identify an acoustic description that is not 
associated with a word. In another system, a set of text-based 
rules are used to form an acoustic description. 

However, Gupta does not show or suggest determining an 
acoustic description from the text and an acoustic description 
from the speech signal and then selecting between the two 
acoustic descriptions. Instead, text-based acoustic descriptions 
are used in separate systems from speech-based acoustic 
descriptions . 

Note that in the Gupta system itself, only text-based 
acoustic descriptions are used. Specifically, "[t]he feature 
vectors for each utterance are used to score the allophonic graph 
generated on the basis of the orthographic representation of the 
new word." (Gupta, col. 13, lines 61-63). Thus, graph scoring 
unit 404 does not generate a speech-based phonetic description 
that does not use the text of a word, but simply scores the text- 
based phonetic descriptions proposed by graph generator 400. 
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The fact that Gupta does not produce a speech-based 
phonetic description can be seen clearly by removing all of the 
phonetic descriptions that use text. If this is done, allophone 
graph generator 400 produces an empty graph because the graph is 
only populated using letter-to-phoneme rules. (see Col. 5, lines 
24-39) This empty graph is provided to graph scorer 404, which 
is then unable to function since it does not have any phonetic 
sequences to apply the speech signal against. If Gupta produced 
a speech-based phonetic description, this would not be true since 
the speech-based phonetic description would still be present even 
if the text-based phonetic descriptions were removed. 

As noted above, Hab-Umbach also fails to show the 
selection between a text-based phonetic description and a speech- 
based phonetic description based in part on the correspondence 
between each phonetic description and a representation of a 
speech signal. 

Since neither Gupta nor Hab-Umbach select between a 
text-based phonetic description and a speech-based phonetic 
description based on the correspondence between the phonetic 
descriptions and a speech signal, their combination does not show 
or suggest the invention of claim 12 or claims 13-17, which 
depend therefrom. 

Claim 18 

Claim 18 was rejected under 35 U.S.C. § 103(a) as being 
unpatentable over Gupta in view of Hab-Umbach and further in view 
of Contolini. Claim 18 depends from claim 12 and thus includes 
the limitation to selecting between a text-based phonetic 
description and a speech-based phonetic description based in part 
on the correspondence between each phonetic description and a 
representation of a speech signal. None of the cited references 
show this limitation. 

In particular, Cantolini does not show or suggest 
selecting between a text-based phonetic description and a speech- 
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based phonetic description, because it does not show or suggest 
producing a speech-based phonetic description from a speech 
signal without using the text of the word. 

Since none of the cited references select between a 
text-based phonetic description and a speech-based phonetic 
description based on a correspondence between the phonetic 
descriptions and a speech signal, the combination of these 
references does not show or suggest the invention of claim 18. 

Claims 19-21 

Claims 19-21 were rejected under 35 U.S.C. § 103(a) as 
being obvious from Schulze (U.S. Patent No. 6,167,369) in view of 
Gupta. 

Schulze describes a system for determining the language 
of a document. To do this, Schulze generates a set of trigram 
models for each language, where each trigram model provides the 
probability of a character trigram in the language. An input 
text is then divided into trigrams. The trigrams for the input 
text are scored using the models for each language to generate a 
total score for each language. Schulze does not show or suggest 
syllable-like units or forming n-grams of syllable-like units. 

Independent claim 19 provides a speech recognition 
system with a language model that is trained through a series of 
steps that include breaking each word in a dictionary into 
syllable-like units and for each word, grouping the syllable-like 
units into n-grams. The total number of n-gram occurrences in 
the dictionary is counted and for each n-gram, the total number 
of occurrences of the particular n-gram is divided by the total 
number of n-gram occurrences in the dictionary to form a language 
model probability for the n-gram. 

The combination of Schulze and Gupta does not show or 
suggest the invention of claim 19. In particular, neither 
reference shows or suggests grouping syllable-like units found in 
dictionary words into n-grams. 
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In the Office Action, it was asserted that Schulze 
shows grouping syllable-like units from dictionary words into n- 
grams at column 1, line 29. Applicants respectfully dispute this 
assertion . 

The cited section of Schulze discusses dividing an 
input sentence into individual character trigrams. It does not 
mention syllable-like units or forming n-grams from syllable-like 
units. Furthermore, it would not be obvious to use syllable-like 
units with Schulze. One goal of the Schulze system is to be able 
to identify the language of short text segments. If larger units 
were used instead of individual characters, there would be fewer 
n-gram probabilities calculated for short text segments thereby 
making it more difficult to identify the language of the text. 

In the Office Action, it was asserted that the sub-word 
units of Gupta correspond to a syllable-like unit. Thus, the 
rejection appears to be based on substituting the sub-word units 
of Gupta in the technique described by Schulze. 

However, those skilled in the art would not make such a 
substitution. Under Schulze, the language of the text is 
unknown. Because of this, it would be very difficult and in some 
cases may be impossible to divide the words into syllable-like 
units. In fact, for some languages in Schulze, the text can not 
even be divided into words. (See Schulze Col. 15, lines 50-53). 
Thus, those skilled in the art would not apply the sub-words of 
Gupta to Schulze as suggested by the Examiner. As such, claim 19 
and claims 20 and 21, which depend therefrom, are patentable over 
the combination of Gupta and Schulze. 

In the Office Action, the arguments above were not 
deemed to be persuasive. In particular, the Examiner stated that 
Applicant's argument that Schulze is not applicable because the 
language is unknown is not persuasive because, "The fact that 
Applicant has recognized another advantage which would flow 
naturally from following the suggestion of the prior art can not 
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be the basis for patentability when the differences would 
otherwise be obvious." Applicants have in fact not found another 
advantage that would naturally flow from following the suggestion 
of the prior art. Instead, Applicants have found a technique 
that is incompatible with Schulze. In particular. Applicants are 
pointing out that those skilled in the art would not modify 
Schulze to use syllable-like units, as suggested by the Examiner, 
because the language in Schulze is unknown and as a result it 
would be very difficult and in some cases impossible to divide 
words into syllable-like units. Thus, the substitution suggested 
by the Examiner would not be performed by those skilled in the 
art . 

Claims 20 and 21 are additionally patentable over 
Schulze and Gupta. In claim 20, the dictionary words are broken 
into syllable-like units by preferring syllable-like units that 
occur more frequently in the dictionary than other syllable-like 
units. Neither Schulze nor Gupta show or suggest this additional 
limitation. 

In the Office Action, it was asserted that Schulze 
showed preferring syllable-like units that occur more often at 
column 12, lines 35-37. However, the cited section does not 
discuss syllable-like units or providing a preference for certain 
speech units when dividing a dictionary word into speech units. 
Instead, the cited section states that trigrams with low 
frequency counts are discarded from a trigram array. 

Trigrams found in a corpus cannot be given a preference 
during the search for the trigrams. The reason for this is that 
there is no latitude in how trigrams are identified in a word. 
Under Schulze, the trigrams are identified simply by selecting 
three characters in a row in a word. Just because one three- 
character sequence is later removed from the array does not 
influence the identification of the trigrams in the words. All 
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of the trigrams are identified regardless of which ones are later 
discarded from the array. 

Since the rules for identifying trigrams do not allow a 
preference to be applied so that one trigram is preferred over 
another during trigram identification, the cited section of 
Schulze cannot show or suggest preferring syllable-like units 
that occur more often in a dictionary over other syllable-like 
units when dividing words into syllable-like units. As such, the 
combination of Gupta and Schulze does not show or suggest the 
invention of claims 20 and 21. 

In the Office Action, the argument above was not deemed 
to be persuasive because it was asserted that it relied upon a 
feature that is not recited in the rejected claims. In 
particular, the Office Action made reference to the "rules for 
identifying trigrams" as not being recited in the claims. While 
Applicants agree that rules for identifying trigrams are not 
found in claims 20 and 21, Applicants also note that reference to 
these rules was not made to imply that these limitations were 
found in the claims. Instead, reference to the rules for 
identifying trigrams that are used by Schulze, clearly indicate 
that Schulze does not include the limitations of claims 20 and 
21, which include preferring syllable-like units that occur more 
frequently in the dictionary over syllable-like units that occur 
less frequently. The argument above simply indicates that this 
limitation is not shown by Schulze based on the rules for 
identifying trigrams provided by Schulze. 

Conclusion 

In light of the above remarks, claims 1-21 are 
patentable over the cited art. Reconsideration and allowance of 
the claims is respectfully requested. 

The Director is authorized to charge any fee deficiency 
required by this paper or credit any overpayment to Deposit 
Account No. 23-1123. 
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